Language identification: insights from the classification of hand annotated phone transcripts
نویسندگان
چکیده
Language Identification (LID) of speech can be split into two processes; phone recognition and language modelling. This two stage approach underlies some of the most successful LID systems. As phone recognizers become more accurate it is useful to simulate a very accurate phone recognizer to determine the effect on the overall LID accuracy. This can be done by using phone transcripts. In this paper LID is performed on phone transcripts from six different languages in the OGI multi-language telephone speech corpus. By simulating a phone recognizer that classifies phones into ten broad classes, a simple n-gram model gives low LID equal error rates (EER) of <1% on 30 seconds of test data. Language models based on these accurate phone transcripts can reveal insights into the phonology of different languages.
منابع مشابه
Mobile, L2 vocabulary learning, and fighting illiteracy: A case study of Iranian semi-illiterates beyond transition level
As mobile learning simultaneously employs both handheld computers and mobile telephones and other devices that draw on the same set of functionalities, it throws open the door for swift connection between learners and teachers. This study examined and articulated the impact of the application of mobile devices for teaching English vocabulary items to 123 Iranian semi-illitera...
متن کاملIncorporating Cognitive Linguistic Insights into Classrooms: the Case of Iranian Learners’ Acquisition of If-Clauses
Cognitive linguistics gives the most inclusive, consistent description of how language is organized, used and learned to date. Cognitive linguistics contains a great number of concepts that are useful to second language learners. If-clauses in English, on the other hand, remain intriguing for foreign language learners to struggle with, due to their intrinsic intricacies. EFL grammar books are ...
متن کاملThe Role of Disfluencies in Topic Classification of Human-Human Conversations
We investigate the impact of disfluencies on the task of classifying natural human-human conversations into topics. Disfluencies are distinctive to spoken language, and their effect on a number of spoken language understanding tasks, including spoken language classification, remains largely unknown. We use a subset of Switchboard-I annotated for disfluencies and topics, and investigate the effe...
متن کاملCreation of an Annotated German Broadcast Speech Database for Spoken Document Retrieval
In this paper we present a semi-automatic method for creating annotated data sets from German-language broadcast resources for which audio files as well as transcripts are available on the Internet. The transcripts are required to be reasonably accurate, but not perfect. Our approach is implemented by a integrated bundle of data processing tools, which support the human annotator in the creatio...
متن کاملCore Units of Spoken Grammar in Global ELT Textbooks
Materials evaluation studies have constantly demonstrated that there is no one fixed procedure for conducting textbook evaluation studies. Instead, the criteria must be selected according to the needs and objectives of the context in which evaluation takes place. The speaking skill as part of the communicative competence has been emphasized as an important objective in language teaching. The pr...
متن کامل